A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 50 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.

The data on the 50 customers in the sample is as follows:

  • Charges: Total medical expenses for a particular insurance plan (in dollars)
  • Age: Age of the primary beneficiary
  • BMI: Primary beneficiary’s body mass index (kg/m2)
  • Female: Primary beneficiary’s birth sex (0 = Male, 1 = Female)
  • Children: Number of children covered by health insurance plan (includes other dependents as well)
  • Smoker: Indicator if primary beneficiary is a smoker (0 = non-smoker, 1 = smoker)
  • Cities: Dummy variables for each city with the default being Sanford

Answer the following questions using complete sentences and attach all output, plots, etc. within this report.

Question 1

Randomly select three observations from the sample and exclude from all modeling (i.e. n=47). Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 47 observations.

set.seed(123457)
index <- sample(seq_len(nrow(insurance)), size = 3)
insurance.new <- insurance[-index ,] 
insurance.test <- insurance[index ,]
insurance_86dummy <- insurance.new[-c(4,6,7,8,9)]
insurance_86dummy[,c(1:3)] %>% #this has the min, max, SD, median, and mean all
  tbl_summary(statistic = list(all_continuous() ~ c("{mean} ({sd})",
                                                    "{median} ({p25}, {p75})",
                                                    "{min}, {max}"),
                              all_categorical() ~ "{n} / {N} ({p}%)"),
              type = all_continuous() ~ "continuous2"
  )
Characteristic N = 47
Charges
Mean (SD) 12,317 (11,498)
Median (IQR) 8,604 (4,480, 13,552)
Range 2,494, 55,135
Age
Mean (SD) 42 (13)
Median (IQR) 43 (30, 53)
Range 23, 64
BMI
Mean (SD) 29.0 (5.6)
Median (IQR) 28.5 (25.3, 32.4)
Range 16.8, 42.1

Children Summary Data

summary(insurance_86dummy$Children)#summary information for children with Sd below
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.234   2.000   5.000

Children Standard Deviation

sd(insurance_86dummy$Children)
## [1] 1.18345

Question 2

Provide the correlation between all quantitative variables

corrplot(cor(insurance_86dummy),
        type = "lower",order = "hclust",
        tl.col = "black",
        tl.srt = 45,
        addCoef.col = "black",
        diag = FALSE,)

Question 3

Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?

model <- lm(Charges ~., data = insurance.new)
summary(model)
## 
## Call:
## lm(formula = Charges ~ ., data = insurance.new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11888  -2726  -1065    711  20257 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -14022.39    6563.47  -2.136 0.039145 *  
## Age              287.26      77.04   3.729 0.000626 ***
## BMI              434.97     200.14   2.173 0.036058 *  
## Female           858.33    2120.59   0.405 0.687923    
## Children         118.17     873.64   0.135 0.893122    
## Smoker         23108.13    3009.97   7.677 3.04e-09 ***
## WinterSprings  -1659.04    3069.60  -0.540 0.592024    
## WinterPark     -4853.57    3009.55  -1.613 0.115080    
## Oviedo         -3769.38    2566.29  -1.469 0.150115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6722 on 38 degrees of freedom
## Multiple R-squared:  0.7176, Adjusted R-squared:  0.6582 
## F-statistic: 12.07 on 8 and 38 DF,  p-value: 2.224e-08
par(mfrow=c(2,2))
plot(model)

scatterplotMatrix(insurance_86dummy)

3rd Assumption - Nonlineraity. Residuals v Fitted. FUNCTIONAL FORMS. - Consider using ratios or percentages rather than raw data (see module on multicollinearity for a complete discussion of the associated problems and causes)

6th Assumption - Normal Distribution Is Not In Place. [Normal Q-Q)] - look for subgroups in data and analyze separately; use summary data (like the mean value) rather than the raw data

4th Assumption - Heteroskedaticity Is Occuring Within Scale-Location

Question 4

Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?

Model 1 & 2 Against Bad Model

insurance.new$LogCharges <- log(insurance.new$Charges)
par(mfrow=c(1,2))
hist(insurance.new$Charges, main="Insurance Charges")
hist(insurance.new$LogCharges, main="Log of Insurance Charges")

Model Specification Doc 4 . 2 logInsurance Interaction Trm 4 - 4

insurance_LogChg86Dummy <- insurance.new[,c(10,2:3,5)]
scatterplotMatrix(insurance_LogChg86Dummy, main = "Log Charges (No Dummies)")

scatterplotMatrix(insurance_86dummy, main = "Charges (No Dummies)")

model_LogCharges <- lm(LogCharges~., data = insurance.new[,c(10,2:9)])##See if we want to put a summary model here
summary(model_LogCharges)
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(10, 2:9)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65510 -0.14862 -0.05322  0.03263  1.28444 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.033276   0.387771  18.138  < 2e-16 ***
## Age            0.034991   0.004552   7.688 2.94e-09 ***
## BMI            0.011547   0.011824   0.977    0.335    
## Female         0.054880   0.125285   0.438    0.664    
## Children       0.063550   0.051615   1.231    0.226    
## Smoker         1.324284   0.177829   7.447 6.16e-09 ***
## WinterSprings -0.007282   0.181353  -0.040    0.968    
## WinterPark    -0.051822   0.177804  -0.291    0.772    
## Oviedo        -0.144341   0.151617  -0.952    0.347    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7487 
## F-statistic: 18.13 on 8 and 38 DF,  p-value: 8.493e-11
insurance.new$LogAge <- log(insurance.new$Age)
insurance.new$AgeSq <- insurance.new$Age^2

insurance_LogChrgAgeWDummy <- insurance.new[,c(11,10,3:9)]
insurance_LogChrgAge86Dummy <- insurance.new[,c(10,11,3,5)]
insurance_LogChrgAgeSq86Dummy <- insurance.new[,c(10,12,3,5)]

model_LogChargesNAge <- lm(LogCharges~., data = insurance_LogChrgAgeWDummy)

summary(model_LogChargesNAge)#Model: Age with a logaritmic shape
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance_LogChrgAgeWDummy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58853 -0.17786 -0.05451  0.02616  1.27653 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.17330    0.73015   4.346 9.98e-05 ***
## LogAge         1.42556    0.18545   7.687 2.95e-09 ***
## BMI            0.01451    0.01178   1.232    0.225    
## Female         0.06560    0.12535   0.523    0.604    
## Children       0.05664    0.05168   1.096    0.280    
## Smoker         1.32511    0.17782   7.452 6.07e-09 ***
## WinterSprings -0.02476    0.18155  -0.136    0.892    
## WinterPark    -0.07879    0.17815  -0.442    0.661    
## Oviedo        -0.14899    0.15168  -0.982    0.332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7487 
## F-statistic: 18.13 on 8 and 38 DF,  p-value: 8.507e-11
par(mfrow=c(2,2))
plot(model_LogChargesNAge)

scatterplotMatrix(insurance_LogChrgAge86Dummy, main = "Log Charges and Age (No Dummies)")

scatterplotMatrix(insurance_LogChrgAgeSq86Dummy, main = "Log Charges and Age Sqd (No Dummies)")

model_LogChrgAgeSq <- lm(LogCharges ~., data = insurance.new[,c(12,2:10)])
summary(model_LogChrgAgeSq)#Model: Age with a Quadratic Relationship
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(12, 2:10)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62995 -0.14987 -0.05370  0.02717  1.28495 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.7322920  0.9363606   7.190 1.59e-08 ***
## AgeSq         -0.0001643  0.0004640  -0.354    0.725    
## Age            0.0492269  0.0404749   1.216    0.232    
## BMI            0.0124770  0.0122478   1.019    0.315    
## Female         0.0605778  0.1277695   0.474    0.638    
## Children       0.0598072  0.0532787   1.123    0.269    
## Smoker         1.3245151  0.1799132   7.362 9.39e-09 ***
## WinterSprings -0.0149998  0.1847672  -0.081    0.936    
## WinterPark    -0.0626046  0.1824473  -0.343    0.733    
## Oviedo        -0.1470754  0.1535865  -0.958    0.344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4018 on 37 degrees of freedom
## Multiple R-squared:  0.7931, Adjusted R-squared:  0.7428 
## F-statistic: 15.76 on 9 and 37 DF,  p-value: 3.566e-10
par(mfrow=c(2,2))
plot(model_LogChrgAgeSq)

insurance.new$LogBMI <- log(insurance.new$BMI)
insurance.new$BMISq <- insurance.new$BMI^2

insurance_LogBMIWDummy <- insurance.new[,c(13,2,4:10)]
insurance_LogChrgBMI86Dummy <- insurance.new[,c(2,5,10,13)]
insurance_LogChrgBMISq86Dummy <- insurance.new[,c(2,5,10,14)]


model_LogChrgBMIWDummy <- lm(LogCharges~., data = insurance.new[,c(2,4:10,13)])

summary(model_LogChrgBMIWDummy)#Model: BMI with a logaritmic shape
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2, 4:10, 
##     13)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65610 -0.15185 -0.05397  0.02865  1.27595 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.313609   1.106364   5.707 1.44e-06 ***
## Age            0.034944   0.004564   7.656 3.25e-09 ***
## Female         0.056410   0.125504   0.449    0.656    
## Children       0.064999   0.051857   1.253    0.218    
## Smoker         1.323267   0.177896   7.438 6.32e-09 ***
## WinterSprings -0.005992   0.181873  -0.033    0.974    
## WinterPark    -0.045362   0.176489  -0.257    0.799    
## Oviedo        -0.140444   0.151103  -0.929    0.359    
## LogBMI         0.314013   0.330373   0.950    0.348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 38 degrees of freedom
## Multiple R-squared:  0.7921, Adjusted R-squared:  0.7484 
## F-statistic:  18.1 on 8 and 38 DF,  p-value: 8.695e-11
par(mfrow=c(2,2))
plot(model_LogChrgBMIWDummy)

scatterplotMatrix(insurance_LogChrgBMI86Dummy, main = "Log Charges and BMI (No Dummies)")

scatterplotMatrix(insurance_LogChrgBMISq86Dummy, main = "Log Charges and BMI Sqd (No Dummies")

model_LogChrgBMISq <- lm(LogCharges ~., data = insurance.new[,c(2:10,14)])
summary(model_LogChrgBMISq)#Model: Age with a Quadratic Relationship
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65467 -0.14654 -0.04853  0.03424  1.28639 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.111e+00  1.356e+00   5.245 6.61e-06 ***
## Age            3.502e-02  4.643e-03   7.543 5.43e-09 ***
## BMI            6.116e-03  9.201e-02   0.066    0.947    
## Female         5.393e-02  1.280e-01   0.422    0.676    
## Children       6.287e-02  5.354e-02   1.174    0.248    
## Smoker         1.324e+00  1.802e-01   7.349 9.77e-09 ***
## WinterSprings -8.169e-03  1.844e-01  -0.044    0.965    
## WinterPark    -5.396e-02  1.837e-01  -0.294    0.771    
## Oviedo        -1.452e-01  1.542e-01  -0.941    0.353    
## BMISq          9.296e-05  1.562e-03   0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7419 
## F-statistic: 15.69 on 9 and 37 DF,  p-value: 3.779e-10
par(mfrow=c(2,2))
plot(model_LogChrgBMISq)

Question 5

Use the 3 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)

insurance.test$LogCharges <- log(insurance.test$Charges)
insurance.test$BMISq <- insurance.test$BMI^2
insurance.test$AgeSq <- insurance.test$Age^2
insurance.test$bad_model_pred <- predict(model, newdata = insurance.test)

insurance.test$model_1_pred <- predict(model_LogChrgBMISq,newdata = insurance.test) %>% exp()

insurance.test$model_2_pred <- predict(model_LogChrgAgeSq,newdata = insurance.test) %>% exp()

# Finding the error

insurance.test$error_bm <- insurance.test$bad_model_pred - insurance.test$Charges

insurance.test$error_1 <- insurance.test$model_1_pred - insurance.test$Charges

insurance.test$error_2 <- insurance.test$model_2_pred - insurance.test$Charges

bias

# Bad Model
mean(insurance.test$error_bm)
## [1] 2096.91
# Model 1
mean(insurance.test$error_1)
## [1] 240.616
# Model 2
mean(insurance.test$error_2)
## [1] 356.8711

MAE

# I decided to create a function to calculate this

mae <- function(error_vector){
  error_vector %>% 
  abs() %>% 
  mean()
}

# Bad Model
mae(insurance.test$error_bm)
## [1] 5282.157
# Model 1
mae(insurance.test$error_1)
## [1] 412.3407
# Model 2
mae(insurance.test$error_2)
## [1] 512.8377

RMSE

rmse <- function(error_vector){
   error_vector^2 %>% 
  mean() %>% 
  sqrt()

}

# Bad Model
rmse(insurance.test$error_bm)
## [1] 6720.431
# Model 1
rmse(insurance.test$error_1)
## [1] 429.0247
# Model 2
rmse(insurance.test$error_2)
## [1] 584.066

MAPE

mape <- function(error_vector, actual_vector){
  (error_vector/actual_vector) %>% 
    abs() %>% 
    mean()
}

# Bad Model
mape(insurance.test$error_bm, insurance.test$Charges)
## [1] 0.6206971
# Model 1
mape(insurance.test$error_1, insurance.test$Charges)
## [1] 0.07086708
# Model 2
mape(insurance.test$error_2, insurance.test$Charges)
## [1] 0.07259645

The initial model performed the worst when compared to the other two. When compared to the other two, the bias, MAE, and MAPE of the logarithmic connection are lower. Since Model 2’s RMSE is lower, there were no significant prediction mistakes. Depending on your preferred time frame, you could choose any model. Model 2 is appropriate if you’re considering the near future. If you are considering the long term, choose Model 1.

Question 6

Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.

#Verbal Based Response
model_LogChrgBMISq <- lm(LogCharges ~., data = insurance.new[,c(2:10,14)])
summary(model_LogChrgBMISq)
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65467 -0.14654 -0.04853  0.03424  1.28639 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.111e+00  1.356e+00   5.245 6.61e-06 ***
## Age            3.502e-02  4.643e-03   7.543 5.43e-09 ***
## BMI            6.116e-03  9.201e-02   0.066    0.947    
## Female         5.393e-02  1.280e-01   0.422    0.676    
## Children       6.287e-02  5.354e-02   1.174    0.248    
## Smoker         1.324e+00  1.802e-01   7.349 9.77e-09 ***
## WinterSprings -8.169e-03  1.844e-01  -0.044    0.965    
## WinterPark    -5.396e-02  1.837e-01  -0.294    0.771    
## Oviedo        -1.452e-01  1.542e-01  -0.941    0.353    
## BMISq          9.296e-05  1.562e-03   0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7419 
## F-statistic: 15.69 on 9 and 37 DF,  p-value: 3.779e-10
##Positive Intercept. Age Increases In A Linear Fashion As Does Charges. As BMI Increases So Does Charges. If Client Is Female Charges Increases Which Makes Sense For Pregnancy Charges. All Locations Help Decreases Charges Unless Default At Sanford.

##Of All SEEx2 Tests - Children Appears To Show The Most Room For Error

Question 7

An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.

Customer Age BMI Female Children Smoker City
1 60 22 1 0 0 Oviedo
2 40 30 0 1 0 Sanford
3 25 25 0 0 1 Winter Park
4 33 35 1 2 0 Winter Springs
5 45 27 1 3 0 Oviedo
#Find Models And Run Indexed Variables
model_LogChrgBMISq
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Coefficients:
##   (Intercept)            Age            BMI         Female       Children  
##     7.111e+00      3.502e-02      6.116e-03      5.393e-02      6.287e-02  
##        Smoker  WinterSprings     WinterPark         Oviedo          BMISq  
##     1.324e+00     -8.169e-03     -5.396e-02     -1.452e-01      9.296e-05
Clients <- data.frame(
  Age = c(60, 40, 25, 33, 45),
  BMI = c(22,30,25,35,27),
  BMISq = c(22^2),(30^2),(25^2),(35^2),(27^2),
  Female = c(1,0,0,1,1),
  Children = c(0,1,0,2,3),
  Smoker = c(0,0,1,0,0),
  WinterSprings = c(0,0,0,1,0),
  WinterPark = c(0,0,1,0,0),
  Oviedo = c(1,0,0,0,1)
  )
predict(model_LogChrgBMISq, newdata = Clients,interval = "prediction")
##        fit      lwr      upr
## 1 9.300244 8.376884 10.22360
## 2 8.802799 7.193021 10.41258
## 3 9.454450 8.342384 10.56652
## 4 8.696853 6.141578 11.25213
## 5 8.994086 7.727012 10.26116

Question 8

The owner notices that some of the predictions are wider than others, explain why.

Reference Questions 1& 2

Verbal Response ## Question 9

Are there any prediction problems that occur with the five potential clients? If so, explain. Verbal Response Reference Questions 1& 2